-
Notifications
You must be signed in to change notification settings - Fork 32
🐛🎨⚗️Computational backend stability: improvements step 1 #8323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛🎨⚗️Computational backend stability: improvements step 1 #8323
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #8323 +/- ##
==========================================
- Coverage 87.83% 87.83% -0.01%
==========================================
Files 1945 1945
Lines 75397 75430 +33
Branches 1311 1311
==========================================
+ Hits 66226 66254 +28
- Misses 8776 8781 +5
Partials 395 395
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
aa2ee3e to
5fa7dac
Compare
d9c1e1b to
8a082ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves the stability of the computational backend by addressing timeout issues when retrieving task results from the dask-scheduler. Previously, when task result retrieval timed out, tasks would be immediately marked as FAILED even if they were actually successful. The new implementation introduces a retry mechanism that waits for a configurable period before marking tasks as failed.
Key changes:
- Implements retry logic for task result retrieval with timeout handling
- Adds new configuration setting for maximum wait time when retrieving results
- Refactors task result processing to handle different error types appropriately
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_scheduler_dask.py |
Major refactoring of task result processing logic with new error handling and retry mechanisms |
services/director-v2/src/simcore_service_director_v2/core/settings.py |
Adds new configuration setting for maximum wait time when retrieving results |
services/director-v2/src/simcore_service_director_v2/modules/dask_client.py |
Updates logging and removes get_cluster_details method |
services/director-v2/tests/unit/with_dbs/comp_scheduler/test_scheduler_dask.py |
Adds comprehensive test for the new retry behavior |
services/director-v2/tests/unit/with_dbs/comp_scheduler/conftest.py |
Adds fixture for testing the new timeout configuration |
services/director-v2/tests/unit/test_modules_dask_client.py |
Removes test for deleted get_cluster_details functionality |
packages/dask-task-models-library/src/dask_task_models_library/plugins/task_life_cycle_worker_plugin.py |
Adds type assertion for worker instance |
services/director-v2/src/simcore_service_director_v2/core/settings.py
Outdated
Show resolved
Hide resolved
services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_scheduler_dask.py
Outdated
Show resolved
Hide resolved
services/director-v2/tests/unit/with_dbs/comp_scheduler/test_scheduler_dask.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx so much!
services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_scheduler_dask.py
Show resolved
Hide resolved
services/director-v2/src/simcore_service_director_v2/modules/dask_client.py
Outdated
Show resolved
Hide resolved
4af263a to
45e2c7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks a lot for the effort. I think this PR should close/resolve #7975 and ITISFoundation/osparc-issues#1952.
services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_scheduler_dask.py
Outdated
Show resolved
Hide resolved
0cbd6d7 to
366ac4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks
…ings.py Co-authored-by: Copilot <[email protected]>
…cheduler_dask.py Co-authored-by: Copilot <[email protected]>
…omp_scheduler/_scheduler_dask.py Co-authored-by: Copilot <[email protected]>
a11cf98 to
19a833b
Compare
|



What do these changes do?
When the dask-scheduler is not responding in time due to timeouts the director-v2 would set the task as FAILED even when it is actually a success. This triggers this kind of error messages:

This PR changes that behavior by retrying to get the task results several times instead. it also introduces a new constant to define that time which is currently hard-coded (aka no new ENV).
This PR is the first step to improve the computational backend stability since a while. There will be following PRs to tackle:
Related issue/s
How to test
Dev-ops